Analysis based on Global Financial Development, World Development Indicators and Government Expenditure datasets

Presented by - Madhurika Situt

Background

• The Global Financial Development Database is an extensive dataset of financial system characteristics for 214 economies.

• The World Development Indicators dataset presents the most current and accurate global development data, and includes national, regional and global estimates.

• The Statistics Division of Food and Agriculture Organization of the United Nations collects annually data on Government Expenditure on Agriculture through a questionnaire, which was developed in partnership with the International Monetary Fund.

Data Source-

Global Financial Development dataset- This data set is downloaded from https://datacatalog.worldbank.org/dataset/global-financial-development . This data contains financial report of 203 economies from 1960 to 2017 The database includes measures of:

  1. Size of financial institutions and markets (financial depth)
  2. Degree to which individuals can and do use financial services (access)
  3. Efficiency of financial intermediaries and markets in intermediating resources and facilitating financial transactions (efficiency)
  4. Stability of financial institutions and markets (stability). CSV files
  5. GFDDCountry.csv
  6. GFDDCountry-Series.csv
  7. GFDDData.csv (around 25000 rows and around 60 column)
  8. GFDDFootNote.csv
  9. GFDDSeries.csv

World Development Indicators dataset- This data set is downloaded from https://datacatalog.worldbank.org/dataset/world-development-indicators .

The primary World Bank collection of development indicators compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.

This data contains financial report of 217 economies from 1960 to 2019.

The database includes measures of- Agriculture and Food Security, Climate Change, Economic Growth, Education, Energy and Extractives, Environment and Natural Resources, Financial Sector Development, Gender, Health, Nutrition and Population, etc.

CSV files

  1. WDICountry.csv
  2. WDICountry-Series.csv
  3. WDICountry-Series.csv
  4. WDIFootNote.csv
  5. WDISeries.csv
  6. WDISeries-Time.csv

Government Expenditure dataset- This data set is downloaded from http://www.fao.org/faostat/en/#data/IG/metadata .

The Statistics Division of FAO collects annually data on Government Expenditure on Agriculture through a questionnaire, which was developed in partnership with the International Monetary Fund. The IMF is the responsible institution for the Government Finance Statistics (GFS) methodology and annually collects GFS data, including Expenditure by Functions of Government (COFOG). The Classification of the Functions of Government (COFOG) is an international classification developed by Organization for Economic Co-operation and Development (OECD) and published by the United Nations Statistical Division (UNSD), with the aim of categories governments' functions according to their purposes.

Time coverage - 2001-2018 Columns – 11 Rows- around 60,000

Analysis goals-

  • How much government has invested in agriculture through years and did agriculture sector had made progress using that investment in all over the world?
  • Top 3 countries that has highest investment on agriculture and their returns?

  • Top 3 countries that has highest returns on agriculture sector and in which year?

  • How much contribution (in percentage) does agriculture and forestry takes place in Gross National Income of the country.

Here, for above analysis, i am creating a function main() that takes years for analysis and coutries list as input from user and all the analysis will be carried out in pdf format in results.pdf file for future references

Machine learning models for predictions on-

  • Hypothesis 1- Does agriculture play important contribution to Gross National Income for that country on that year. (Predictions will be Yes/No).
  • Hypothesis 2 -Are the government’s investments in the agriculture and forestry sector profitable or not. (Prediction will be Yes/No)

Libraries

In [1]:
#Loading libraries
import pandas as pd
import numpy as np
import openpyxl
import seaborn as sns
import plotly.graph_objects as go
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import string
import PyPDF2
import os
pd.set_option('display.float_format', lambda x: '%.5f' % x)
from plotly.subplots import make_subplots
from PyPDF2 import PdfFileWriter, PdfFileReader, PdfFileMerger
import glob
import sys
In [2]:
!pip install PyPDF2
Requirement already satisfied: PyPDF2 in c:\users\madhu\anaconda3\lib\site-packages (1.26.0)

Data Preparations

Global Financial Development dataset preparations

In [3]:
#Loading dataset using openpyxl to get sheetnames
wb = openpyxl.load_workbook('GFDD.xlsx')
wb.sheetnames
Out[3]:
['Data', 'Country', 'Series', 'Country-Series', 'FootNote']
In [4]:
#carrying out each sheet file in different dataframes manually

xls = pd.ExcelFile('GFDD.xlsx')
df_data = pd.read_excel(xls, 'Data')

#Carrying out rows and columns dataframes
print("Rows and Columns\n")
print("\ndf_data shape :", df_data.shape)

#carrying out column names for all dataframes
print("Column names\n")
print("\ndf_data shape :", df_data.columns)
Rows and Columns


df_data shape : (25038, 62)
Column names


df_data shape : Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017'],
      dtype='object')
In [5]:
#Carrying out specifics rows which have indicator names present in list_indicators
df_GNI=pd.DataFrame(df_data.loc[df_data['Indicator Name'].isin(['Gross National Income (current US$)'])])
#resetting index
df_GNI=df_GNI.reset_index(drop=True)
print(df_GNI.shape)
df_GNI.head()
(214, 62)
Out[5]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
0 Afghanistan AFG Gross National Income (current US$) NY.GNP.MKTP.CD 550000000.00000 560000000.00000 560000000.00000 770000000.00000 820000000.00000 1000000000.00000 ... 10000000000.00000 12000000000.00000 16000000000.00000 18000000000.00000 20000000000.00000 21000000000.00000 20000000000.00000 20000000000.00000 19000000000.00000 20000000000.00000
1 Albania ALB Gross National Income (current US$) NY.GNP.MKTP.CD nan nan nan nan nan nan ... 13000000000.00000 12000000000.00000 12000000000.00000 13000000000.00000 12000000000.00000 13000000000.00000 13000000000.00000 12000000000.00000 12000000000.00000 13000000000.00000
2 Algeria DZA Gross National Income (current US$) NY.GNP.MKTP.CD 3200000000.00000 3000000000.00000 2200000000.00000 2800000000.00000 2900000000.00000 3100000000.00000 ... 170000000000.00000 140000000000.00000 160000000000.00000 200000000000.00000 210000000000.00000 210000000000.00000 210000000000.00000 160000000000.00000 160000000000.00000 160000000000.00000
3 Andorra AND Gross National Income (current US$) NY.GNP.MKTP.CD nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
4 Angola AGO Gross National Income (current US$) NY.GNP.MKTP.CD nan nan nan nan nan nan ... 75000000000.00000 63000000000.00000 76000000000.00000 100000000000.00000 120000000000.00000 130000000000.00000 140000000000.00000 110000000000.00000 96000000000.00000 120000000000.00000

5 rows × 62 columns

In [6]:
#Checking dataframe info
print("df_data info-\n", df_GNI.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 62 columns):
Country Name      214 non-null object
Country Code      214 non-null object
Indicator Name    214 non-null object
Indicator Code    214 non-null object
1960              10 non-null float64
1961              10 non-null float64
1962              11 non-null float64
1963              11 non-null float64
1964              11 non-null float64
1965              11 non-null float64
1966              11 non-null float64
1967              11 non-null float64
1968              11 non-null float64
1969              11 non-null float64
1970              11 non-null float64
1971              11 non-null float64
1972              11 non-null float64
1973              12 non-null float64
1974              12 non-null float64
1975              12 non-null float64
1976              12 non-null float64
1977              13 non-null float64
1978              13 non-null float64
1979              13 non-null float64
1980              16 non-null float64
1981              16 non-null float64
1982              15 non-null float64
1983              15 non-null float64
1984              15 non-null float64
1985              15 non-null float64
1986              17 non-null float64
1987              17 non-null float64
1988              17 non-null float64
1989              18 non-null float64
1990              20 non-null float64
1991              20 non-null float64
1992              20 non-null float64
1993              20 non-null float64
1994              20 non-null float64
1995              20 non-null float64
1996              103 non-null float64
1997              110 non-null float64
1998              110 non-null float64
1999              112 non-null float64
2000              116 non-null float64
2001              120 non-null float64
2002              122 non-null float64
2003              120 non-null float64
2004              128 non-null float64
2005              133 non-null float64
2006              138 non-null float64
2007              140 non-null float64
2008              139 non-null float64
2009              140 non-null float64
2010              139 non-null float64
2011              144 non-null float64
2012              145 non-null float64
2013              143 non-null float64
2014              142 non-null float64
2015              146 non-null float64
2016              139 non-null float64
2017              145 non-null float64
dtypes: float64(58), object(4)
memory usage: 103.8+ KB
df_data info-
 None

We dont need 'Country Code ' , 'Indicator Name' and 'Indicator Code' columns . So dropping those columns

In [7]:
#Dropping unwanted columns 
print("Columns which we are dropping:\n",df_GNI.columns[range(1,4)])
df_GNI=df_GNI.drop(df_GNI.columns[range(1,4)],axis=1)
df_GNI.columns
Columns which we are dropping:
 Index(['Country Code', 'Indicator Name', 'Indicator Code'], dtype='object')
Out[7]:
Index(['Country Name', '1960', '1961', '1962', '1963', '1964', '1965', '1966',
       '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975',
       '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984',
       '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993',
       '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002',
       '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011',
       '2012', '2013', '2014', '2015', '2016', '2017'],
      dtype='object')
In [8]:
#Dropping rows that complete null values from column 1960 to 2017
temp_li=df_GNI.columns[range(4,59)]
df_GNI=df_GNI.dropna(how='all',subset=temp_li)
print("Number of null rows\n",(df_GNI.T).isnull().sum())
Number of null rows
 0      19
1      46
2       0
4      42
5      17
       ..
207    39
208    36
211    54
212    36
213     0
Length: 157, dtype: int64
In [9]:
#Checking for null values in df_data
fig, ax = plt.subplots(figsize=(20,5))         # Sample figsize in inches
sns.heatmap(df_GNI.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1da60d82898>
In [10]:
#Filling null values with mean values.

list_GFDD_min=[]
list_GFDD_min=list(df_GNI[df_GNI.columns[range(4,59)]].min(axis=1))
list_GFDD_min=[ '%.2f' % elem for elem in list_GFDD_min]

#Filling null values with corresponding mean values row wise
for i in range(df_GNI.shape[0]):
    df_GNI.iloc[i]=df_GNI.iloc[i].fillna(list_GFDD_min[i])
In [11]:
#Checking for null values in df_data
fig, ax = plt.subplots(figsize=(20,5))         # Sample figsize in inches
sns.heatmap(df_GNI.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1da5ee00320>
In [12]:
#cleaning string from country names
for i in range(df_GNI.shape[0]):
    df_GNI['Country Name'].iloc[i]=df_GNI['Country Name'].iloc[i].replace(' ','')
    df_GNI['Country Name'].iloc[i]=df_GNI['Country Name'].iloc[i].capitalize()

#Setting Country name as index
df_GNI=df_GNI.set_index(df_GNI.columns[0])
df_GNI.index
Out[12]:
Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antiguaandbarbuda',
       'Argentina', 'Armenia', 'Aruba', 'Australia', 'Austria',
       ...
       'Unitedarabemirates', 'Unitedkingdom', 'Unitedstates', 'Uruguay',
       'Uzbekistan', 'Venezuela,rb', 'Vietnam', 'Yemen,rep.', 'Zambia',
       'Zimbabwe'],
      dtype='object', name='Country Name', length=157)
In [13]:
#Carrying out countries having maximum GNI from 1960 to 2017

# Changing datatype from object to float
for i in range(df_GNI.shape[1]):
    df_GNI[df_GNI.columns[i]]=df_GNI[df_GNI.columns[i]].astype(float)
#list_max_GNI=list(df_data_GNI.iloc[:, 3:25].max(axis=1))
df_GNI['Max_GNI']=list(df_GNI.iloc[:, 0:58].max(axis=1))


df_GNI.head()
Out[13]:
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 ... 2009 2010 2011 2012 2013 2014 2015 2016 2017 Max_GNI
Country Name
Afghanistan 550000000.00000 560000000.00000 560000000.00000 770000000.00000 820000000.00000 1000000000.00000 1400000000.00000 1700000000.00000 1400000000.00000 1400000000.00000 ... 12000000000.00000 16000000000.00000 18000000000.00000 20000000000.00000 21000000000.00000 20000000000.00000 20000000000.00000 19000000000.00000 20000000000.00000 21000000000.00000
Albania 9200000000.00000 9200000000.00000 9200000000.00000 9200000000.00000 9200000000.00000 9200000000.00000 9200000000.00000 9200000000.00000 9200000000.00000 9200000000.00000 ... 12000000000.00000 12000000000.00000 13000000000.00000 12000000000.00000 13000000000.00000 13000000000.00000 12000000000.00000 12000000000.00000 13000000000.00000 13000000000.00000
Algeria 3200000000.00000 3000000000.00000 2200000000.00000 2800000000.00000 2900000000.00000 3100000000.00000 2900000000.00000 3300000000.00000 3800000000.00000 4300000000.00000 ... 140000000000.00000 160000000000.00000 200000000000.00000 210000000000.00000 210000000000.00000 210000000000.00000 160000000000.00000 160000000000.00000 160000000000.00000 210000000000.00000
Angola 14000000000.00000 14000000000.00000 14000000000.00000 14000000000.00000 14000000000.00000 14000000000.00000 14000000000.00000 14000000000.00000 14000000000.00000 14000000000.00000 ... 63000000000.00000 76000000000.00000 100000000000.00000 120000000000.00000 130000000000.00000 140000000000.00000 110000000000.00000 96000000000.00000 120000000000.00000 140000000000.00000
Antiguaandbarbuda 77000000.00000 77000000.00000 77000000.00000 77000000.00000 77000000.00000 77000000.00000 77000000.00000 77000000.00000 77000000.00000 77000000.00000 ... 1200000000.00000 1100000000.00000 1100000000.00000 1200000000.00000 1100000000.00000 1200000000.00000 1300000000.00000 1400000000.00000 1400000000.00000 1400000000.00000

5 rows × 59 columns

In [14]:
#plotting in bar graph
fig = go.Figure(data=[
    go.Bar(name='Maximum GNI', x=df_GNI.index, y=df_GNI.Max_GNI)
])
# Change the bar mode
fig.update_layout(barmode='group',title_text='Maximum gross income in all economic countries',autosize=False,width=1000,height=600)
fig.show()

Government Expenditure dataset preparations

In [15]:
#Loading datasets

df_GI=pd.read_csv('Investment_GovernmentExpenditure_E_All_Data_(Normalized).csv',encoding='latin-1')
df_GI
Out[15]:
Area Code Area Item Code Item Element Code Element Year Code Year Unit Value Flag Note
0 2 Afghanistan 23130 Total outlays (General Government) 6109 Value Local Currency 2006 2006 millions 111274.57000 X IMF Government Finance Statistics (GFS)
1 2 Afghanistan 23130 Total outlays (General Government) 6109 Value Local Currency 2007 2007 millions 165029.87000 X IMF Government Finance Statistics (GFS)
2 2 Afghanistan 23130 Total outlays (General Government) 6109 Value Local Currency 2008 2008 millions 466732.04000 X IMF Government Finance Statistics (GFS)
3 2 Afghanistan 23130 Total outlays (General Government) 6109 Value Local Currency 2009 2009 millions 449927.62000 X IMF Government Finance Statistics (GFS)
4 2 Afghanistan 23130 Total outlays (General Government) 6109 Value Local Currency 2010 2010 millions 599141.98000 X IMF Government Finance Statistics (GFS)
... ... ... ... ... ... ... ... ... ... ... ... ...
65394 251 Zambia 23174 Environmental protection (Central Government) 6167 Value US$, 2010 prices 2013 2013 millions 1.06000 Qm NaN
65395 251 Zambia 23174 Environmental protection (Central Government) 6167 Value US$, 2010 prices 2014 2014 millions 0.82000 Qm NaN
65396 251 Zambia 23174 Environmental protection (Central Government) 6167 Value US$, 2010 prices 2015 2015 millions 0.94000 Qm NaN
65397 251 Zambia 23174 Environmental protection (Central Government) 6167 Value US$, 2010 prices 2016 2016 millions 15.26000 Qm NaN
65398 251 Zambia 23174 Environmental protection (Central Government) 6167 Value US$, 2010 prices 2017 2017 millions 17.50000 Qm NaN

65399 rows × 12 columns

In [16]:
#Checking df_GI info
df_GI.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65399 entries, 0 to 65398
Data columns (total 12 columns):
Area Code       65399 non-null int64
Area            65399 non-null object
Item Code       65399 non-null int64
Item            65399 non-null object
Element Code    65399 non-null int64
Element         65399 non-null object
Year Code       65399 non-null int64
Year            65399 non-null int64
Unit            65399 non-null object
Value           65399 non-null float64
Flag            65399 non-null object
Note            5712 non-null object
dtypes: float64(1), int64(5), object(6)
memory usage: 6.0+ MB
In [17]:
#Carrying out specifics rows which have item names present in list
df_GI=pd.DataFrame(df_GI.loc[df_GI['Item'].isin(['Agriculture, forestry, fishing (General Government)',
'Agriculture, forestry, fishing, Recurrent (General Government)',
'Agriculture, forestry, fishing, Capital (General Government)',
'Agriculture (General Government)',
'Agriculture, Recurrent (General Government)',
'Agriculture, Capital (General Government)',
'R&D Agriculture, forestry, fishing (General Government)',
'Agriculture, forestry, fishing, Recurrent (Central Government)',
'Agriculture, forestry, fishing, Capital (Central Government)',
'Agriculture (Central Government)',
'Agriculture, Recurrent (Central Government)',
'Agriculture, Capital (Central Government)',
'R&D Agriculture, forestry, fishing (Central Government)'

])])
#resetting index
df_GI=df_GI.reset_index(drop=True)
print(df_GI.shape)
df_GI
(23463, 12)
Out[17]:
Area Code Area Item Code Item Element Code Element Year Code Year Unit Value Flag Note
0 2 Afghanistan 23131 Agriculture, forestry, fishing (General Govern... 6109 Value Local Currency 2006 2006 millions 5213.30000 X IMF Government Finance Statistics (GFS)
1 2 Afghanistan 23131 Agriculture, forestry, fishing (General Govern... 6109 Value Local Currency 2007 2007 millions 12296.15000 X IMF Government Finance Statistics (GFS)
2 2 Afghanistan 23131 Agriculture, forestry, fishing (General Govern... 6109 Value Local Currency 2008 2008 millions 13868.60000 X IMF Government Finance Statistics (GFS)
3 2 Afghanistan 23131 Agriculture, forestry, fishing (General Govern... 6109 Value Local Currency 2009 2009 millions 19154.09000 X IMF Government Finance Statistics (GFS)
4 2 Afghanistan 23131 Agriculture, forestry, fishing (General Govern... 6109 Value Local Currency 2010 2010 millions 28051.22000 X IMF Government Finance Statistics (GFS)
... ... ... ... ... ... ... ... ... ... ... ... ...
23458 251 Zambia 23173 R&D Agriculture, forestry, fishing (Central Go... 6167 Value US$, 2010 prices 2010 2010 millions 5.63000 Qm NaN
23459 251 Zambia 23173 R&D Agriculture, forestry, fishing (Central Go... 6167 Value US$, 2010 prices 2014 2014 millions 4.89000 Qm NaN
23460 251 Zambia 23173 R&D Agriculture, forestry, fishing (Central Go... 6167 Value US$, 2010 prices 2015 2015 millions 4.27000 Qm NaN
23461 251 Zambia 23173 R&D Agriculture, forestry, fishing (Central Go... 6167 Value US$, 2010 prices 2016 2016 millions 2.86000 Qm NaN
23462 251 Zambia 23173 R&D Agriculture, forestry, fishing (Central Go... 6167 Value US$, 2010 prices 2017 2017 millions 7.72000 Qm NaN

23463 rows × 12 columns

In [18]:
#df_GI columns
df_GI.columns
Out[18]:
Index(['Area Code', 'Area', 'Item Code', 'Item', 'Element Code', 'Element',
       'Year Code', 'Year', 'Unit', 'Value', 'Flag', 'Note'],
      dtype='object')

We just need 'Area', 'Year', 'Item' and 'Value' columns for our analysis. So dropping rest of the columns.

In [19]:
# Removing unwanted columns
list_GI_columns=['Area Code', 'Item Code', 'Element Code', 'Element','Year Code', 'Unit', 'Flag', 'Note']
print("Columns which we are dropping:\n",df_GI[list_GI_columns].columns)
df_GI=df_GI.drop(df_GI[list_GI_columns].columns,axis=1)
df_GI.columns
Columns which we are dropping:
 Index(['Area Code', 'Item Code', 'Element Code', 'Element', 'Year Code',
       'Unit', 'Flag', 'Note'],
      dtype='object')
Out[19]:
Index(['Area', 'Item', 'Year', 'Value'], dtype='object')
In [20]:
#Grouping by Area and Year and adding all the values for different items
df_GI=df_GI.groupby(['Area','Year'])['Value'].sum().to_frame().reset_index()
df_GI
Out[20]:
Area Year Value
0 Afghanistan 2006 5477.81000
1 Afghanistan 2007 12887.08000
2 Afghanistan 2008 14488.07000
3 Afghanistan 2009 20010.43000
4 Afghanistan 2010 29263.64000
... ... ... ...
1626 Zambia 2013 5469.96000
1627 Zambia 2014 8768.37000
1628 Zambia 2015 18895.22000
1629 Zambia 2016 12815.14000
1630 Zambia 2017 15769.72000

1631 rows × 3 columns

In [21]:
# Creating unique list of year and country
list_GI_year=list(range(2001,2018))  # we are just taking years till 2017

# carryout out only top 5 countries having maximum GNI
#list_GI_country=['United States of America','China, mainland','Japan','Germany','United Kingdom' ]
list_GI_country=list(df_GI.Area.unique())

#sorting list
list_GI_year.sort()

list_GI_country

#list_GI_year
Out[21]:
['Afghanistan',
 'Albania',
 'Angola',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Belarus',
 'Belgium',
 'Benin',
 'Bhutan',
 'Bolivia',
 'Botswana',
 'Brazil',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Cabo Verde',
 'Canada',
 'Chile',
 'China',
 'Colombia',
 'Cook Islands',
 'Costa Rica',
 'Croatia',
 'Cyprus',
 'Czechia',
 "Côte d'Ivoire",
 'Democratic Republic of the Congo',
 'Denmark',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Estonia',
 'Eswatini',
 'Fiji',
 'Finland',
 'France',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Grenada',
 'Guatemala',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Hungary',
 'Iceland',
 'Indonesia',
 'Ireland',
 'Israel',
 'Italy',
 'Jamaica',
 'Japan',
 'Jordan',
 'Kazakhstan',
 'Kenya',
 'Kuwait',
 'Kyrgyzstan',
 'Latvia',
 'Lebanon',
 'Lesotho',
 'Liberia',
 'Lithuania',
 'Luxembourg',
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Maldives',
 'Mali',
 'Malta',
 'Mauritius',
 'Mexico',
 'Mongolia',
 'Montenegro',
 'Morocco',
 'Mozambique',
 'Myanmar',
 'Namibia',
 'Nepal',
 'Netherlands',
 'New Caledonia',
 'New Zealand',
 'Niger',
 'Nigeria',
 'Norway',
 'Oman',
 'Pakistan',
 'Panama',
 'Paraguay',
 'Poland',
 'Portugal',
 'Republic of Korea',
 'Republic of Moldova',
 'Romania',
 'Russian Federation',
 'Rwanda',
 'Samoa',
 'San Marino',
 'Sao Tome and Principe',
 'Saudi Arabia',
 'Serbia',
 'Seychelles',
 'Sierra Leone',
 'Singapore',
 'Slovakia',
 'Slovenia',
 'South Africa',
 'South Sudan',
 'Spain',
 'Sri Lanka',
 'St. Kitts and Nevis',
 'St. Lucia',
 'St. Vincent and the Grenadines',
 'Sudan',
 'Sweden',
 'Switzerland',
 'Thailand',
 'Timor-Leste',
 'Togo',
 'Trinidad and Tobago',
 'Tunisia',
 'Turkey',
 'Ukraine',
 'United Arab Emirates',
 'United Kingdom',
 'United Republic of Tanzania',
 'United States',
 'Uzbekistan',
 'Vanuatu',
 'Vietnam',
 'Zambia']
In [22]:
#Checking for null values in df_data
fig, ax = plt.subplots(figsize=(20,5))         # Sample figsize in inches
sns.heatmap(df_GI.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1da6549b518>
In [23]:
#Creating draft dataframe

# Creating dataframe
df_GI_v1=pd.DataFrame(columns=list_GI_country, index=list_GI_year)
df_GI_v1.head()
Out[23]:
Afghanistan Albania Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas ... Turkey Ukraine United Arab Emirates United Kingdom United Republic of Tanzania United States Uzbekistan Vanuatu Vietnam Zambia
2001 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2002 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2003 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2004 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 139 columns

In [24]:
#Filling the draft dataframe df_GI_v1 with values from df_GI dataframe

for k in range(df_GI.shape[0]):
    for i in range(len(list_GI_year)):
        for j in range(len(list_GI_country)):        
            if df_GI['Year'].iloc[k]== list_GI_year[i] and df_GI['Area'].iloc[k] == list_GI_country[j]:
                df_GI_v1[df_GI_v1.columns[j]].iloc[i]=df_GI['Value'].iloc[k]
In [25]:
#df_GI_v1 columns and rows
df_GI_v1.shape
Out[25]:
(17, 139)
In [26]:
#df_GI_v1 info

df_GI_v1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 17 entries, 2001 to 2017
Columns: 139 entries, Afghanistan to Zambia
dtypes: object(139)
memory usage: 18.6+ KB
In [27]:
#Checking for null values in df_data
df_GI_v1.isnull().sum()
fig, ax = plt.subplots(figsize=(20,5))         # Sample figsize in inches
sns.heatmap(df_GI_v1.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x1da65b144e0>
In [28]:
#Dropping columns that have complete null values

df_GI_v1=df_GI_v1.dropna(axis='columns', how='all')

#Filling null values with average values column wise

df_GI_v1=df_GI_v1.fillna(df_GI_v1.min())

df_GI_v1.shape
Out[28]:
(17, 137)
In [29]:
#Checking for null values in df_data
df_GI_v1.isnull().sum()
fig, ax = plt.subplots(figsize=(20,5))         # Sample figsize in inches
sns.heatmap(df_GI_v1.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x1da65d877f0>
In [30]:
#cleaning string from list_GI_country columns

df_GI_v1_country_list=list(df_GI_v1.columns)
for i in range(len(df_GI_v1_country_list)):
    df_GI_v1_country_list[i]=df_GI_v1_country_list[i].replace(' ','')
    df_GI_v1_country_list[i]=df_GI_v1_country_list[i].capitalize()
df_GI_v1.columns=df_GI_v1_country_list

df_GI_v1.columns
Out[30]:
Index(['Afghanistan', 'Albania', 'Angola', 'Antiguaandbarbuda', 'Argentina',
       'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas',
       ...
       'Turkey', 'Ukraine', 'Unitedarabemirates', 'Unitedkingdom',
       'Unitedrepublicoftanzania', 'Unitedstates', 'Uzbekistan', 'Vanuatu',
       'Vietnam', 'Zambia'],
      dtype='object', length=137)
In [31]:
df_GI_v1.head()
Out[31]:
Afghanistan Albania Angola Antiguaandbarbuda Argentina Armenia Australia Austria Azerbaijan Bahamas ... Turkey Ukraine Unitedarabemirates Unitedkingdom Unitedrepublicoftanzania Unitedstates Uzbekistan Vanuatu Vietnam Zambia
2001 5477.81000 5065.02000 3360.78000 99.29000 2420.14000 103762.15000 26049.98000 4464.79000 995.79000 38.20000 ... 17047.12000 1742.68000 2663.36000 15707.65000 301833.06000 221113.48000 1400937.10000 416.72000 36679397.84000 250.51000
2002 5477.81000 5065.02000 3559.18000 99.29000 1279.47000 103762.15000 26471.37000 5000.40000 995.79000 38.90000 ... 17047.12000 1742.68000 2663.36000 9891.27000 198309.43000 160245.09000 1400937.10000 416.72000 36679397.84000 291.07000
2003 5477.81000 5065.02000 9983.17000 99.29000 1325.63000 103762.15000 29092.12000 5390.73000 995.79000 41.25000 ... 17047.12000 1742.68000 2663.36000 10928.36000 272494.43000 180041.10000 1400937.10000 416.72000 36679397.84000 476.24000
2004 5477.81000 5065.02000 44105.26000 99.29000 1709.11000 103762.15000 29765.39000 5188.26000 995.79000 126.60000 ... 17047.12000 4686.68000 2663.36000 12808.83000 972064.16000 155428.89000 1400937.10000 416.72000 36679397.84000 1118.59000
2005 5477.81000 5065.02000 98749.36000 99.29000 2007.39000 103762.15000 26094.02000 5288.02000 995.79000 140.44000 ... 17047.12000 5647.43000 2663.36000 14579.37000 564053.86000 222846.97000 1400937.10000 498.61000 36679397.84000 1538.67000

5 rows × 137 columns

In [32]:
#Assign values to each exam dataframe and carrying out in to excel sheet
#output = pd.ExcelWriter('df_GI_v1.xlsx')  
#df_GI_v1.to_excel(output, sheet_name='sheet1')
#output.save()
In [33]:
#Plotting population increase in plotly 
fig = go.Figure()
for i in range(df_GI_v1.shape[1]):
               fig.add_trace(go.Scatter(x=df_GI_v1.index, y=df_GI_v1[df_GI_v1.columns[i]], name=df_GI_v1.columns[i]))

#fig.add_trace(go.Scatter(x=df_GI_v1.index, y=df_GI_v1['Brazil'], name=df_GI_v1.columns[0]))
fig.update_layout(title_text='Government expenditure on Agriculture sector for different economies')
fig.show()

World Development Indicators dataset preparations

In [34]:
#Loading datasets

df_WDI=pd.read_csv('WDI_csv/WDIData.csv')
df_WDI
Out[34]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
0 Arab World ARB 2005 PPP conversion factor, GDP (LCU per inter... PA.NUS.PPP.05 nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
1 Arab World ARB 2005 PPP conversion factor, private consumptio... PA.NUS.PRVT.PP.05 nan nan nan nan nan nan ... nan nan nan nan nan nan nan nan nan nan
2 Arab World ARB Access to clean fuels and technologies for coo... EG.CFT.ACCS.ZS nan nan nan nan nan nan ... 82.36810 82.78329 83.12030 83.53346 83.89760 84.17160 84.51017 nan nan nan
3 Arab World ARB Access to electricity (% of population) EG.ELC.ACCS.ZS nan nan nan nan nan nan ... 86.00762 86.42827 87.07058 88.17684 87.34274 89.13012 89.67869 90.27369 nan nan
4 Arab World ARB Access to electricity, rural (% of rural popul... EG.ELC.ACCS.RU.ZS nan nan nan nan nan nan ... 73.46665 73.94210 75.24410 77.16230 75.53898 78.74115 79.66564 80.74929 nan nan
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
377779 Zimbabwe ZWE Women who believe a husband is justified in be... SG.VAW.NEGL.ZS nan nan nan nan nan nan ... nan 21.40000 nan nan nan 21.40000 nan nan nan nan
377780 Zimbabwe ZWE Women who believe a husband is justified in be... SG.VAW.REFU.ZS nan nan nan nan nan nan ... nan 16.90000 nan nan nan 14.50000 nan nan nan nan
377781 Zimbabwe ZWE Women who were first married by age 15 (% of w... SP.M15.2024.FE.ZS nan nan nan nan nan nan ... nan 3.90000 nan nan nan 3.70000 nan nan nan nan
377782 Zimbabwe ZWE Women who were first married by age 18 (% of w... SP.M18.2024.FE.ZS nan nan nan nan nan nan ... nan 30.50000 nan nan 33.50000 32.40000 nan nan nan nan
377783 Zimbabwe ZWE Women's share of population ages 15+ living wi... SH.DYN.AIDS.FE.ZS nan nan nan nan nan nan ... 58.90000 59.10000 59.30000 59.50000 59.60000 59.60000 59.70000 59.70000 59.80000 nan

377784 rows × 64 columns

Here, we are to take specific columns and specific indicator name ('Agriculture, forestry, and fishing, value added (current US$')for our analysis .

In [35]:
#Carrying out specifics rows which have indicator names present in list_indicators
df_WDI=pd.DataFrame(df_WDI.loc[df_WDI['Indicator Name'].isin(['Agriculture, forestry, and fishing, value added (current US$)'])])
#resetting index
df_WDI=df_WDI.reset_index(drop=True)
print(df_WDI.shape)
df_WDI.columns
(264, 64)
Out[35]:
Index(['Country Name', 'Country Code', 'Indicator Name', 'Indicator Code',
       '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
       '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
       '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017', '2018', '2019'],
      dtype='object')
In [36]:
# Removing unwanted columns
df_WDI_columns=['Country Code','Indicator Name','Indicator Code','2018','2019']
print("Columns which we are dropping:\n",df_WDI_columns)
df_WDI=df_WDI.drop(df_WDI_columns,axis=1)
df_WDI.columns
Columns which we are dropping:
 ['Country Code', 'Indicator Name', 'Indicator Code', '2018', '2019']
Out[36]:
Index(['Country Name', '1960', '1961', '1962', '1963', '1964', '1965', '1966',
       '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975',
       '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984',
       '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993',
       '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002',
       '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011',
       '2012', '2013', '2014', '2015', '2016', '2017'],
      dtype='object')
In [37]:
#Dropping rows that complete null values
temp_WDI_list=df_WDI.columns[range(1,59)]
df_WDI=df_WDI.dropna(how='all',subset=temp_WDI_list)
In [38]:
#df_WDI information
df_WDI.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 248 entries, 0 to 263
Data columns (total 59 columns):
Country Name    248 non-null object
1960            54 non-null float64
1961            54 non-null float64
1962            54 non-null float64
1963            56 non-null float64
1964            57 non-null float64
1965            77 non-null float64
1966            80 non-null float64
1967            81 non-null float64
1968            85 non-null float64
1969            86 non-null float64
1970            94 non-null float64
1971            95 non-null float64
1972            95 non-null float64
1973            95 non-null float64
1974            96 non-null float64
1975            101 non-null float64
1976            104 non-null float64
1977            110 non-null float64
1978            112 non-null float64
1979            113 non-null float64
1980            123 non-null float64
1981            135 non-null float64
1982            135 non-null float64
1983            138 non-null float64
1984            139 non-null float64
1985            141 non-null float64
1986            143 non-null float64
1987            145 non-null float64
1988            145 non-null float64
1989            151 non-null float64
1990            171 non-null float64
1991            172 non-null float64
1992            174 non-null float64
1993            180 non-null float64
1994            187 non-null float64
1995            213 non-null float64
1996            214 non-null float64
1997            218 non-null float64
1998            218 non-null float64
1999            219 non-null float64
2000            228 non-null float64
2001            229 non-null float64
2002            231 non-null float64
2003            232 non-null float64
2004            234 non-null float64
2005            233 non-null float64
2006            237 non-null float64
2007            235 non-null float64
2008            237 non-null float64
2009            237 non-null float64
2010            235 non-null float64
2011            236 non-null float64
2012            237 non-null float64
2013            238 non-null float64
2014            237 non-null float64
2015            236 non-null float64
2016            231 non-null float64
2017            219 non-null float64
dtypes: float64(58), object(1)
memory usage: 116.2+ KB
In [39]:
#Checking for null values in df_data
fig, ax = plt.subplots(figsize=(20,5))         # Sample figsize in inches
sns.heatmap(df_WDI.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x1da685674e0>
In [40]:
#Filling null values with mean values.

df_WDI_min=[]
df_WDI_min=list(df_WDI[df_WDI.columns[range(1,59)]].min(axis=1))
df_WDI_min=[ '%.2f' % elem for elem in df_WDI_min]

#Filling null values with corresponding mean values row wise
for i in range(df_WDI.shape[0]):
    df_WDI.iloc[i]=df_WDI.iloc[i].fillna(df_WDI_min[i])
    
#cleaning string from country names
for i in range(df_WDI.shape[0]):
    df_WDI['Country Name'].iloc[i]=df_WDI['Country Name'].iloc[i].replace(' ','')
    df_WDI['Country Name'].iloc[i]=df_WDI['Country Name'].iloc[i].capitalize()
In [41]:
# Checking for null values in df_data
fig, ax = plt.subplots(figsize=(20,5))         # Sample figsize in inches
sns.heatmap(df_WDI.isnull(),yticklabels=False,cbar=True,cmap='viridis', ax=ax)
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1da0036f4e0>

As we are doing analysis on 3 datasets,carrying out the data from the countries which are available in all three datasets.

In [42]:
#Carrying out coutries list from df_GI_v1,df_GNI,df_WDI
GFDD_country_list=list(df_GNI.index)
GI_country_list=list(df_GI_v1.columns)
WDI_country_list=list(df_WDI['Country Name'].unique())
In [43]:
#Carrying out country list common in all three datasets
country_list=list(set(WDI_country_list) & set(GI_country_list) & set(GFDD_country_list))
country_list.sort()
print(country_list)
len(country_list)
['Afghanistan', 'Albania', 'Angola', 'Antiguaandbarbuda', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh', 'Belarus', 'Belgium', 'Benin', 'Bolivia', 'Botswana', 'Brazil', 'Bruneidarussalam', 'Bulgaria', 'Burkinafaso', 'Caboverde', 'Canada', 'Chile', 'China', 'Colombia', 'Costarica', 'Croatia', 'Cyprus', "Côted'ivoire", 'Denmark', 'Dominicanrepublic', 'Ecuador', 'Egypt', 'Elsalvador', 'Estonia', 'Fiji', 'Finland', 'France', 'Georgia', 'Germany', 'Ghana', 'Greece', 'Guatemala', 'Guinea', 'Guyana', 'Hungary', 'Indonesia', 'Ireland', 'Israel', 'Italy', 'Jamaica', 'Japan', 'Jordan', 'Kazakhstan', 'Kenya', 'Kuwait', 'Latvia', 'Lebanon', 'Lithuania', 'Luxembourg', 'Madagascar', 'Malawi', 'Malaysia', 'Mali', 'Malta', 'Mauritius', 'Mexico', 'Mongolia', 'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia', 'Nepal', 'Netherlands', 'Newzealand', 'Niger', 'Nigeria', 'Norway', 'Oman', 'Pakistan', 'Panama', 'Paraguay', 'Poland', 'Portugal', 'Romania', 'Russianfederation', 'Rwanda', 'Saudiarabia', 'Serbia', 'Sierraleone', 'Singapore', 'Slovenia', 'Southafrica', 'Spain', 'Srilanka', 'Sweden', 'Switzerland', 'Thailand', 'Togo', 'Trinidadandtobago', 'Tunisia', 'Turkey', 'Ukraine', 'Unitedarabemirates', 'Unitedkingdom', 'Unitedstates', 'Uzbekistan', 'Vietnam', 'Zambia']
Out[43]:
111

Updating all three dataframes and keeping the rows that contains data of information about country_list items

In [44]:
#Carrying out specifics rows for each dataframe which have indicator names present in list_indicators

# for df_GNI
df_GNI=(pd.DataFrame(df_GNI.loc[df_GNI.index.isin(country_list)])).T
print(df_GNI.shape)
df_GNI.head()
df_GNI=df_GNI.drop(index='Max_GNI')
# converting datatype of index to numbers
df_GNI.index=df_GNI.index.astype(int)
(59, 111)
In [45]:
df_GNI.head()
Out[45]:
Country Name Afghanistan Albania Angola Antiguaandbarbuda Argentina Armenia Australia Austria Azerbaijan Bahamas ... Trinidadandtobago Tunisia Turkey Ukraine Unitedarabemirates Unitedkingdom Unitedstates Uzbekistan Vietnam Zambia
1960 550000000.00000 9200000000.00000 14000000000.00000 77000000.00000 19000000000.00000 1100000000.00000 19000000000.00000 190000000000.00000 3100000000.00000 150000000.00000 ... 5200000000.00000 19000000000.00000 180000000000.00000 30000000000.00000 110000000000.00000 1400000000000.00000 8000000000000.00000 9500000000.00000 24000000000.00000 3200000000.00000
1961 560000000.00000 9200000000.00000 14000000000.00000 77000000.00000 19000000000.00000 1100000000.00000 20000000000.00000 190000000000.00000 3100000000.00000 170000000.00000 ... 5200000000.00000 19000000000.00000 180000000000.00000 30000000000.00000 110000000000.00000 1400000000000.00000 8000000000000.00000 9500000000.00000 24000000000.00000 3200000000.00000
1962 560000000.00000 9200000000.00000 14000000000.00000 77000000.00000 25000000000.00000 1100000000.00000 20000000000.00000 190000000000.00000 3100000000.00000 190000000.00000 ... 5200000000.00000 19000000000.00000 180000000000.00000 30000000000.00000 110000000000.00000 1400000000000.00000 8000000000000.00000 9500000000.00000 24000000000.00000 3200000000.00000
1963 770000000.00000 9200000000.00000 14000000000.00000 77000000.00000 19000000000.00000 1100000000.00000 21000000000.00000 190000000000.00000 3100000000.00000 210000000.00000 ... 5200000000.00000 19000000000.00000 180000000000.00000 30000000000.00000 110000000000.00000 1400000000000.00000 8000000000000.00000 9500000000.00000 24000000000.00000 3200000000.00000
1964 820000000.00000 9200000000.00000 14000000000.00000 77000000.00000 26000000000.00000 1100000000.00000 24000000000.00000 190000000000.00000 3100000000.00000 230000000.00000 ... 5200000000.00000 19000000000.00000 180000000000.00000 30000000000.00000 110000000000.00000 1400000000000.00000 8000000000000.00000 9500000000.00000 24000000000.00000 3200000000.00000

5 rows × 111 columns

In [46]:
#for df_GI
df_GI_v1=df_GI_v1[country_list]
df_GI_v1=df_GI_v1*1000000 #multiplying with 1000000 since unit of this dataset is in millions 
print(df_GI_v1.shape)
df_GI_v1.head()
(17, 111)
Out[46]:
Afghanistan Albania Angola Antiguaandbarbuda Argentina Armenia Australia Austria Azerbaijan Bahamas ... Trinidadandtobago Tunisia Turkey Ukraine Unitedarabemirates Unitedkingdom Unitedstates Uzbekistan Vietnam Zambia
2001 5477810000.00000 5065020000.00000 3360780000.00000 99290000.00000 2420140000.00000 103762150000.00003 26049980000.00000 4464790000.00000 995790000.00000 38200000.00000 ... 1081340000.00000 1822910000.00000 17047120000.00000 1742680000.00000 2663360000.00000 15707650000.00000 221113479999.99997 1400937099999.99976 36679397840000.00000 250510000.00000
2002 5477810000.00000 5065020000.00000 3559180000.00000 99290000.00000 1279470000.00000 103762150000.00003 26471370000.00000 5000400000.00000 995790000.00000 38900000.00000 ... 1148370000.00000 1822910000.00000 17047120000.00000 1742680000.00000 2663360000.00000 9891270000.00000 160245090000.00000 1400937099999.99976 36679397840000.00000 291070000.00000
2003 5477810000.00000 5065020000.00000 9983170000.00000 99290000.00000 1325630000.00000 103762150000.00003 29092120000.00000 5390730000.00000 995790000.00000 41250000.00000 ... 1193130000.00000 1822910000.00000 17047120000.00000 1742680000.00000 2663360000.00000 10928360000.00000 180041099999.99997 1400937099999.99976 36679397840000.00000 476240000.00000
2004 5477810000.00000 5065020000.00000 44105260000.00000 99290000.00000 1709110000.00000 103762150000.00003 29765390000.00000 5188260000.00000 995790000.00000 126600000.00000 ... 1383570000.00000 1822910000.00000 17047120000.00000 4686680000.00000 2663360000.00000 12808830000.00000 155428890000.00000 1400937099999.99976 36679397840000.00000 1118590000.00000
2005 5477810000.00000 5065020000.00000 98749360000.00002 99290000.00000 2007390000.00000 103762150000.00003 26094020000.00000 5288020000.00000 995790000.00000 140440000.00000 ... 1733890000.00000 1822910000.00000 17047120000.00000 5647430000.00000 2663360000.00000 14579370000.00000 222846969999.99997 1400937099999.99976 36679397840000.00000 1538670000.00000

5 rows × 111 columns

In [47]:
#for df_WDI

#Setting Country name as index
df_WDI=df_WDI.set_index(df_WDI.columns[0])

df_WDI=(pd.DataFrame(df_WDI.loc[df_WDI.index.isin(country_list)])).T
print(df_WDI.shape)
# converting datatype of index to numbers
df_WDI.index=df_WDI.index.astype(int)
(58, 111)
In [48]:
df_WDI.head()
Out[48]:
Country Name Afghanistan Albania Angola Antiguaandbarbuda Argentina Armenia Australia Austria Azerbaijan Bahamas ... Trinidadandtobago Tunisia Turkey Ukraine Unitedarabemirates Unitedkingdom Unitedstates Uzbekistan Vietnam Zambia
1960 1553453355.00 366374583.00 368504825.80 3852264.81 2420909129.00 1728457123.00 9920018457.00 2175389532.00 773631285.10 41700000.00 ... 48532928.89 176952381.00 7685984381.00000 3750203844.00 79519349.71 12110263423.00 106000000000.00 2884468878.00 2506941522.00 81714285.71000
1961 1553453355.00 366374583.00 368504825.80 3852264.81 2420909129.00 1728457123.00 9920018457.00 2175389532.00 773631285.10 41700000.00 ... 48532928.89 176952381.00 4133333333.00000 3750203844.00 79519349.71 12110263423.00 106000000000.00 2884468878.00 2506941522.00 88857142.86000
1962 1553453355.00 366374583.00 368504825.80 3852264.81 2420909129.00 1728457123.00 9920018457.00 2175389532.00 773631285.10 41700000.00 ... 48532928.89 176952381.00 4711111111.00000 3750203844.00 79519349.71 12110263423.00 106000000000.00 2884468878.00 2506941522.00 85571428.57000
1963 1553453355.00 366374583.00 368504825.80 3852264.81 2420909129.00 1728457123.00 9920018457.00 2175389532.00 773631285.10 41700000.00 ... 48532928.89 176952381.00 5522222222.00000 3750203844.00 79519349.71 12110263423.00 106000000000.00 2884468878.00 2506941522.00 93571428.57000
1964 1553453355.00 366374583.00 368504825.80 3852264.81 2420909129.00 1728457123.00 9920018457.00 2175389532.00 773631285.10 41700000.00 ... 48532928.89 176952381.00 5644444444.00000 3750203844.00 79519349.71 12110263423.00 106000000000.00 2884468878.00 2506941522.00 100571428.60000

5 rows × 111 columns

Analysis

Analysis 1:How much government has invested in agriculture through years and did agriculture sector had made progress using that investment in all over the world?

Since government expenditure data set contains data for years starting from 2000. So we we will be performing analysis from 2001 to 2017( For 17 years)

In below function, years and countries are provided dynamically by user.For this analysis, dataset used are Government expenditure (df_GI_v1) and World development indicators(df_WDI)

In [49]:
#How much government has invested in agriculture and carry out agriculture returns in all chosen countries

def analysis1(start,end,country_list):
    year=(list(range(start,end)))
    year=[str(x) for x in year]
    
    #Carrying out specific data as per start year , end year and country list
    
    df_WDI_v1=pd.DataFrame(df_WDI.loc[df_WDI.index.isin(year)])
    df_WDI_v1=df_WDI_v1[country_list]

    df_GI_v2=pd.DataFrame(df_GI_v1.loc[df_GI_v1.index.isin(year)])
    df_GI_v2=df_GI_v2[country_list]
    
    # visualising the hypothesis using histograms
    #plotting in bar graph
    fig = go.Figure()
    for i in range(len(country_list)):
        fig.add_trace(go.Bar(name=country_list[i] + ' Government Investment on Agriculture', x=df_GI_v2.index, y=df_GI_v2[country_list[i]]))
        fig.add_trace(go.Scatter(name=country_list[i] + ' Agriculture Returns', x=df_WDI_v1.index, y=df_WDI_v1[country_list[i]]))
              
    # Change the bar mode
    fig.update_layout(barmode='group',title_text='Government investment on agriculture and its returns',autosize=True,width=1000,height=600,
                      xaxis = dict(tickmode = 'linear'      
                      
        ))   
   
    
    return fig
    
    

Analysis 2:Top 3 countries that has highest investment on agriculture and their returns

In below function, years and countries are provided dynamically by user.For this analysis, dataset used are Government expenditure (df_GI_v1) and World development indicators(df_WDI)

In [50]:
#Top 3 countries that has highest investment on agriculture and their returns
def analysis2(start_A2,end_A2,countries_A2):
    list_year_A2=list(range(start_A2,end_A2+1))
    print(list_year_A2)
    list_year_A2_GI_index=[]

  #carrying out index of years from government expenditure table
    for i in range(len(list_year_A2)):
        list_year_A2_GI_index.append(list(df_GI_v1.index).index((list_year_A2[i])))

  #carrying out index of years from agriculture returns table
    list_year_A2_WDI_index=[]
    for i in range(len(list_year_A2)):
        list_year_A2_WDI_index.append(list(df_WDI.index).index((list_year_A2[i])))
  

    temp_A2=[]
    for i in range(len(list_year_A2_GI_index)):
        temp_A2.append(pd.DataFrame(df_GI_v1[countries_A2].iloc[list_year_A2_GI_index[i]]))

    df_temp_A2=pd.DataFrame(index=countries_A2)

    for i in range(len(temp_A2)):
        df_temp_A2[(list(temp_A2[i].columns))[0]]=list(temp_A2[i][temp_A2[i].columns[0]])

    df_temp_A2=df_temp_A2.T

    list_year=[]
    list_GI_country=[]
    list_GI_max_values=[]
    list_Agri_return=[]

    temp=(pd.DataFrame(df_temp_A2.max()))
    temp.columns=['Max']
    temp=temp.sort_values('Max',ascending=False)
    temp=temp.head(3)
    list_GI_country=list(temp.index)

    #print(list_GI_country)
    #print(temp)

    #creating draft dataframe for analysis
    df_GI_WDI_columns=[list_GI_country[0]+'_Govt_invest',
                    list_GI_country[0]+'_Agri_return',
                    list_GI_country[1]+'_Govt_invest',
                    list_GI_country[1]+'_Agri_return',
                    list_GI_country[2]+'_Govt_invest',
                    list_GI_country[2]+'_Agri_return',
                    ]
    df_GI_WDI=pd.DataFrame(columns=df_GI_WDI_columns,index=df_temp_A2.index)
  #assigning values to dataframe
    for i in range(len(list_GI_country)):
        df_GI_WDI[list_GI_country[i]+'_Govt_invest']=list(df_temp_A2[list_GI_country[i]])
        for j in range(len(list_year_A2_WDI_index)):
            df_GI_WDI[list_GI_country[i]+'_Agri_return'].iloc[j]=df_WDI[list_GI_country[i]].iloc[list_year_A2_WDI_index[j]]
    
    #plotting analysis
    fig = make_subplots(rows=3, cols=2)
    fig.add_trace(go.Bar(name=df_GI_WDI.columns[0] + ' Government Investment on Agriculture', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[0]]),row=1, col=1)
    fig.add_trace(go.Scatter(name=df_GI_WDI.columns[1] + ' Agriculture Returns', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[1]]),row=1,col=2)
    fig.add_trace(go.Bar(name=df_GI_WDI.columns[2] + ' Government Investment on Agriculture', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[2]]),row=2, col=1)
    fig.add_trace(go.Scatter(name=df_GI_WDI.columns[3] + ' Agriculture Returns', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[3]]),row=2,col=2)
    fig.add_trace(go.Bar(name=df_GI_WDI.columns[4] + ' Government Investment on Agriculture', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[4]]),row=3, col=1)
    fig.add_trace(go.Scatter(name=df_GI_WDI.columns[5] + ' Agriculture Returns', x=df_GI_WDI.index, y=df_GI_WDI[df_GI_WDI.columns[5]]),row=3,col=2)

    fig.update_layout(height=1000, width=1500, title_text="Government Investment on Agriculture and its returns")

    return fig

Analysis 3:Top 3 countries that has highest returns on agriculture sector and in which year?

In below function, years and countries are provided dynamically by user.For this analysis, dataset used are World development indicators(df_WDI)

In [51]:
#Top 3 countries that has highest returns on agriculture sector and in which year

def analysis3(start_A3,end_A3,countries_A3):
    list_year_A3=list(range(start_A3,end_A3+1))
    list_year_A3_WDI_index=[]
    #carrying out index of years from agriculture returns table
    for i in range(len(list_year_A3)):
        list_year_A3_WDI_index.append(list(df_WDI.index).index((list_year_A3[i])))


    temp_A3=[]
    for i in range(len(list_year_A3_WDI_index)):
        temp_A3.append(pd.DataFrame(df_WDI[countries_A3].iloc[list_year_A3_WDI_index[i]]))
    # creating draft dataframe
    df_temp_A3=pd.DataFrame(index=countries_A3)

    for i in range(len(temp_A3)):
        df_temp_A3[(list(temp_A3[i].columns))[0]]=list(temp_A3[i][temp_A3[i].columns[0]])

    df_temp_A3=df_temp_A3.T
    df_temp_A3_v1=pd.DataFrame(df_temp_A3.max())
    df_temp_A3_v1.columns=['Max']
    df_temp_A3_v1=df_temp_A3_v1.sort_values('Max',ascending=False)
    df_temp_A3_v1=df_temp_A3_v1.head(3)

    #creating draft dataframe for analysis
 
    df_WDI_A3=pd.DataFrame(columns=df_temp_A3_v1.index,index=df_temp_A3.index)

    for i in range(df_WDI_A3.shape[1]):
        df_WDI_A3[df_WDI_A3.columns[i]]=list(df_temp_A3[df_WDI_A3.columns[i]])

    df_WDI_A3


    #carrying out total agriculture returns in particular year from df_WDI dataframe for years from list_year_A3

    list_total_agriculture=[]


    for i in range(len(list_year_A3_WDI_index)):
      #print(df_WDI.index[list_year_A3_WDI_index[i]])
      list_total_agriculture.append(sum([float(i) for i in list(df_WDI.iloc[list_year_A3_WDI_index[i]])]))

    # assigning values to new column 'Total_Agri' in df_WDI_A3
    df_WDI_A3['Total_Agri']=list_total_agriculture

    #Carrying out how much percentage of Total agriculture have these top 3 countries contributed

    for i in range(len(list(df_temp_A3_v1.index))):
        df_WDI_A3[list(df_temp_A3_v1.index)[i]+'_percentage']=round((df_WDI_A3[list(df_temp_A3_v1.index)[i]]/df_WDI_A3['Total_Agri'])*100,2)

    #plotting in bar plotly 
    import plotly.graph_objects as go
    x = df_WDI_A3
    fig = go.Figure()
    for i in range(len(list(df_temp_A3_v1.index))):
        fig.add_trace(go.Bar(x=list(x.index), y=list(x[x.columns[i]]) ,name=x.columns[i], text=[s + '%' for s in list(df_WDI_A3[list(df_temp_A3_v1.index)[i]+'_percentage'].astype(str))] ,textposition='auto'))

    fig.update_layout(barmode='relative', title_text='Percentage of total agriculture returns by Top 3 countries',autosize=True)

    
    return fig

Analysis4 : How much contribution (in percentage) does agriculture and forestry takes place in Gross National Income of the country.

In below function, years and countries are provided dynamically by user.For this analysis, dataset used are Gross national income (df_GNI) and World development indicators(df_WDI)

In [52]:
def analysis4(start_A4,end_A4,countries_A4):
    list_year_A4=list(range(start_A4,end_A4+1))
    list_year_A4_GNI_index=[]
    #carrying out index of years from agriculture returns table
    for i in range(len(list_year_A4)):
        list_year_A4_GNI_index.append(list(df_GNI.index).index((list_year_A4[i])))
    #carrying out index of years from agriculture returns table
    list_year_A4_WDI_index=[]
    for i in range(len(list_year_A4)):
        list_year_A4_WDI_index.append(list(df_WDI.index).index((list_year_A4[i])))
    #carrying out data in new dataframe 
    temp_A4=[]
    for i in range(len(list_year_A4_WDI_index)):
            temp_A4.append(pd.DataFrame(df_WDI[countries_A4].iloc[list_year_A4_WDI_index[i]]))
        
    df_temp_A4=pd.DataFrame(index=countries_A4)

    for i in range(len(temp_A4)):
            df_temp_A4[(list(temp_A4[i].columns))[0]]=list(temp_A4[i][temp_A4[i].columns[0]])

    df_temp_A4=df_temp_A4.T
    df_temp_A4

    #creating year column for ploting purpose
    df_temp_A4['Year']=list(df_temp_A4.index)

    #Carrying out percentage of GNI by each country's agriculture returns and assigning it in different column

    for i in range(len(countries_A4)):
        df_temp_A4[countries_A4[i]+'_%_in_GNI']=list(round(((df_temp_A4[countries_A4[i]])/(df_GNI[countries_A4[i]].iloc[list_year_A4_GNI_index]))*100,2))
    
    df_temp_A4

    #plotting in bar graph
    fig = go.Figure()
    table_values=[]

    #carrying out dataframe values to list of lists
    for i in range((df_temp_A4[df_temp_A4.columns[len(countries_A4):]]).shape[1]):
        table_values.append(((df_temp_A4[df_temp_A4.columns[len(countries_A4):]])[(df_temp_A4[df_temp_A4.columns[len(countries_A4):]]).columns[i]]).values.tolist())
    
    fig1 = go.Figure(data=[go.Table(header=dict(values=df_temp_A4.columns[len(countries_A4):]),
                               cells=dict(values=(table_values))
                               )])

    fig1.update_layout(title_text="Table for Percentage of GNI by each country's agriculture returns ",autosize=True,width=1000,height=600)   

    for i in range(len(countries_A4)):
        fig.add_trace(go.Scatter(name=countries_A4[i] + ' Agriculture Returns', x=df_temp_A4.index, y=df_temp_A4[countries_A4[i]],text=[s + '%' for s in list(df_temp_A4[countries_A4[i]+'_%_in_GNI'].astype(str))]))
              
    # Change the bar mode
    fig.update_layout(barmode='group',title_text="Percentage of GNI by each country's agriculture returns ",autosize=True,width=1000,height=600,
                      xaxis = dict(tickmode = 'linear')) 
    
    
    
    return fig1,fig

Standard functions for dynamic user inputs such as years, country list, etc..

In [53]:
def start_year():
      
    try:
        s_year=int(input('Enter start year from 2001 to 2017 -'))
        if s_year in list(range(2001,2018)):
            return int(s_year)
        else:
            print('Out of range. Please enter correct year ')
            start_year()
            
    except ValueError:
        print("Error! This is not a number. Try again.")
        main()
    return 
        
                    
    
In [54]:
def end_year():
    try:
        e_year=int(input('Enter end year from 2001 to 2017 -'))
        if e_year in list(range(2001,2018)):
            return int(e_year)
        else:
            print('Out of range. Please enter correct year ')
            end_year()
            
    except ValueError:
        print("Error! This is not a number. Try again.")
        main()
    return
In [55]:
def number_of_countries():
    try:
        num=int(input('Enter number of countries you want to check -'))
        return num
    except ValueError:
        print("Error! This is not a number. Try again.")
        main()
    return

Below function 'country_name_clean' is created to clean the text input from country list provided by user

In [56]:
def country_name_clean(country_name):
    #removing white spaces
    country_name=country_name.replace(' ','')
    #removing punctuations
    country_name=country_name.translate(str.maketrans('', '', string.punctuation))
    # capitalizing first letter of string
    country_name=country_name.capitalize()          
    return country_name
In [57]:
def country_list():
    country_list=[]
    num=int(number_of_countries())
    for i in range(num):
        country_name=str(input('Enter country name -'))
        if country_name == None:
            print('Enter correct country name')
            country_list()
        else:
            country_name=country_name_clean(country_name)
            if country_name in df_WDI.columns:
                country_list.append(country_name)
            else:
                print('Country name does not found. Please try again.')
                main()
    return country_list,num
In [58]:
def merger(output_path, input_paths):
    pdf_writer = PdfFileWriter()
 
    for path in input_paths:
        pdf_reader = PdfFileReader(path, strict=False)
        for page in range(pdf_reader.getNumPages()):
            pdf_writer.addPage(pdf_reader.getPage(page))
 
    with open(output_path, 'wb') as fh:
        pdf_writer.write(fh)
 
In [59]:
def main():
    choice ='0'
    while choice =='0':
        start= (start_year())
        end= (end_year())
        if end < start:
            print('End year should be greater than start year')
            end= end_year()
        countries=[]
        countries,count= country_list()       
        def choose():
            print("\nMain Analysis menu: Choose 1 of 5 choices\n")
            print("\n1. How much government has invested in agriculture and carry out agriculture returns in all chosen countries")
            print("2. Top 3 countries that has highest investment on agriculture and their returns")
            print("3. Top 3 countries that has highest returns on agriculture sector and in which year")
            print("4. How much contribution (in percentage) does agriculture and forestry takes place in Gross National Income of the country")
            print("5. Exit and carry out report in PDF format")
            
            choice=str(input("Please make a choice:"))
                       
            if choice == "5":
                print("\n\nThank you")
                print("Please open results.pdf")
                if __name__ == '__main__':
                    paths = glob.glob('*.pdf')
                    paths.sort()
                    merger('results.pdf', paths)
            elif choice == "4":
                fig_analysis41,fig_analysis42=analysis4(start,end,countries)
                fig_analysis41.show()
                fig_analysis42.show()
                fig_analysis41.write_image('analysis4_1.pdf',width='1200',height='600')
                fig_analysis42.write_image('analysis4_2.pdf',width='1200',height='600')
                choose()
            elif choice == "3":
                if count > 3:
                    fig_analysis3=analysis3(start,end,countries)
                    fig_analysis3.show()
                    fig_analysis3.write_image('analysis3.pdf',width='1200',height='600')
                    choose()
                else:
                    print("This is valid only for number of countries greater than 3")
                    choose()
            elif choice == "2":
                if count > 3:
                    fig_analysis2=analysis2(start,end,countries)
                    fig_analysis2.show()
                    fig_analysis2.write_image('analysis2.pdf',width='1200',height='600')
                    choose()
                else:
                    print("This is valid only for number of countries greater than 3")
                    choose()
                    
                    
            elif choice == "1":
                fig_analysis1=analysis1(start,end,countries)
                fig_analysis1.show()
                fig_analysis1.write_image('analysis1.pdf',width='1200',height='600')
                choose()
                
            else:
                print("I don't understand your choice.")
                choose()
            return      
        choose()
        choice='1'
    return

For executing all analysis, I have created menu with user interface where user can have selective year analysis where user must enter start year and end year between 2001 to 2017.

Also, the interface allows user to input selective countries for which user wants analysis on. Such as – 'Australia', 'Austria', 'Argentina', 'Denmark', 'Belarus', 'Nigeria', 'Egypt’

After all the analysis, there is one option in menu where user can have all the analysis printed in pdf format (‘results.pdf) and can use it for analyzing or comparing it later.

Below are the screenshots explaining simple example of execution of all the analysis-

image.png

In [60]:
main()
Enter start year from 2001 to 2017 -2005
Enter end year from 2001 to 2017 -2012
Enter number of countries you want to check -4
Enter country name -'Australia',
Enter country name -'Argentina', 
Enter country name -'Denmark',
Enter country name -'Egypt', 

Main Analysis menu: Choose 1 of 5 choices


1. How much government has invested in agriculture and carry out agriculture returns in all chosen countries
2. Top 3 countries that has highest investment on agriculture and their returns
3. Top 3 countries that has highest returns on agriculture sector and in which year
4. How much contribution (in percentage) does agriculture and forestry takes place in Gross National Income of the country
5. Exit and carry out report in PDF format
Please make a choice:1
Main Analysis menu: Choose 1 of 5 choices


1. How much government has invested in agriculture and carry out agriculture returns in all chosen countries
2. Top 3 countries that has highest investment on agriculture and their returns
3. Top 3 countries that has highest returns on agriculture sector and in which year
4. How much contribution (in percentage) does agriculture and forestry takes place in Gross National Income of the country
5. Exit and carry out report in PDF format
Please make a choice:2
[2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012]
Main Analysis menu: Choose 1 of 5 choices


1. How much government has invested in agriculture and carry out agriculture returns in all chosen countries
2. Top 3 countries that has highest investment on agriculture and their returns
3. Top 3 countries that has highest returns on agriculture sector and in which year
4. How much contribution (in percentage) does agriculture and forestry takes place in Gross National Income of the country
5. Exit and carry out report in PDF format
Please make a choice:3
Main Analysis menu: Choose 1 of 5 choices


1. How much government has invested in agriculture and carry out agriculture returns in all chosen countries
2. Top 3 countries that has highest investment on agriculture and their returns
3. Top 3 countries that has highest returns on agriculture sector and in which year
4. How much contribution (in percentage) does agriculture and forestry takes place in Gross National Income of the country
5. Exit and carry out report in PDF format
Please make a choice:
I don't understand your choice.

Main Analysis menu: Choose 1 of 5 choices


1. How much government has invested in agriculture and carry out agriculture returns in all chosen countries
2. Top 3 countries that has highest investment on agriculture and their returns
3. Top 3 countries that has highest returns on agriculture sector and in which year
4. How much contribution (in percentage) does agriculture and forestry takes place in Gross National Income of the country
5. Exit and carry out report in PDF format
Please make a choice:4
Main Analysis menu: Choose 1 of 5 choices


1. How much government has invested in agriculture and carry out agriculture returns in all chosen countries
2. Top 3 countries that has highest investment on agriculture and their returns
3. Top 3 countries that has highest returns on agriculture sector and in which year
4. How much contribution (in percentage) does agriculture and forestry takes place in Gross National Income of the country
5. Exit and carry out report in PDF format
Please make a choice:5


Thank you
Please open results.pdf

Using machine learning algorithmns to create models for below hypothesis-

  1. Does agriculture play important contribution to Gross National Income for that country on that year. (Predictions will be Yes/No).

  2. Are the government’s investments in the agriculture and forestry sector profitable or not. (Prediction will be Yes/No)

Since dataset is small, so I am using k-nearest neighbors algorithm (k-NN) classification model for both the hypothesis as its giving me better accuracy than other models.

k-nearest neighbors algorithm (k-NN)-

It is supervised machine learning algorithm that relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data.

Hypothesis 1- Does agriculture play important contribution to Gross National Income for that country on that year. (Predictions will be Yes/No).

This prediction is for indiviual country based on their data over the years. So preparing a model based on Argentina country data

In [61]:
#Additional libraries to create model
from sklearn.preprocessing import LabelEncoder #label encoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
In [62]:
# creating dataframe
df_hypothesis1=pd.DataFrame(index=df_GNI.index)
df_hypothesis1['Agriculture_returns']=df_WDI['Argentina'].astype(float)
df_hypothesis1['GNI']=df_GNI['Argentina']
df_hypothesis1['%_in_GNI']=round(((df_hypothesis1['Agriculture_returns']/df_hypothesis1['GNI'])*100),2)
df_hypothesis1['Target']=""

# here benchmark is 7.5%. if %_in_GNI is greater then 7.5%, then 'yes' agriculture plays important role in GNI else 'NO'
for i in range(df_hypothesis1.shape[0]):
    if df_hypothesis1['%_in_GNI'].iloc[i] > float(7.5):
        df_hypothesis1['Target'].iloc[i]=str('Yes')
    else:
        df_hypothesis1['Target'].iloc[i]=str('No')
        
        
print(df_hypothesis1.shape)

df_hypothesis1.head()
(58, 4)
Out[62]:
Agriculture_returns GNI %_in_GNI Target
1960 2420909129.00000 19000000000.00000 12.74000 Yes
1961 2420909129.00000 19000000000.00000 12.74000 Yes
1962 2420909129.00000 25000000000.00000 9.68000 Yes
1963 2420909129.00000 19000000000.00000 12.74000 Yes
1964 2420909129.00000 26000000000.00000 9.31000 Yes
In [63]:
# carrying out features and targets

features = df_hypothesis1[df_hypothesis1.columns[range(0,3)]]
target= df_hypothesis1.iloc[:,-1]
# labeling target
le = LabelEncoder()
target=list(le.fit_transform(target))
print(target)
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
In [64]:
#Scaling features

# Load the standard scaler
sc = StandardScaler()

# Scale the feature data to be of mean 0 and of unit variance
features_scaled = sc.fit_transform(features)
print("features shape",features.shape)
#Creating df_feature_scaled to use it in plot
df_features_scaled=pd.DataFrame(features_scaled, columns=features.columns)
print("features_scaled shape",df_features_scaled.shape)
features shape (58, 3)
features_scaled shape (58, 3)
In [65]:
#Plotting the difference between scaled training data and without scaled data
print("*Standard Scaling removes the mean and scales the data to unit variance as we can see in the plot.\n*However, as outliers of each feature have different magnitudes these variables are not scaled at the same scale.\n*Standard Scalar is not guarranting balanced feature scaling.")
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20, 10))

ax1.set_title("Before Scaling")
for i in range(features.shape[1]):
    sns.kdeplot(features[features.columns[i]],ax=ax1)
ax2.set_title("After Standard Scaling")
for i in range(df_features_scaled.shape[1]):
    sns.kdeplot(df_features_scaled[df_features_scaled.columns[i]],ax=ax2)
*Standard Scaling removes the mean and scales the data to unit variance as we can see in the plot.
*However, as outliers of each feature have different magnitudes these variables are not scaled at the same scale.
*Standard Scalar is not guarranting balanced feature scaling.
In [66]:
# Splitting dataset
X_train, X_test, Y_train, Y_test = train_test_split(features_scaled,target,test_size=0.3,random_state=32)
#Creating X_train and X_test  dataframes
df_X_train = pd.DataFrame(X_train)
df_X_test = pd. DataFrame(X_test)
print(df_X_train.shape)
print(df_X_test.shape)
(40, 3)
(18, 3)

Fitting k-nearest neighbors algorithm (k-NN)

In [67]:
#Setting parameters

knn_range = list(range(1,26))
In [68]:
#Checking the k value using GridSearchCV
param_knn = dict(n_neighbors=knn_range)

# Using GridSearchCV to find best parameters for KNN classifier model
grid_knnclf = GridSearchCV(KNeighborsClassifier(), param_knn, cv=10, scoring='accuracy')
grid_knnclf.fit(X_train, Y_train)
print(grid_knnclf.best_score_)
knn_best_param=grid_knnclf.best_params_
print(knn_best_param)
1.0
{'n_neighbors': 1}
DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal. [_search.py:814]
In [69]:
#fitting model

knn_model = KNeighborsClassifier(n_neighbors= knn_best_param['n_neighbors']) 
knn_model.fit(X_train, Y_train)
Out[69]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')
In [70]:
# knn predictions
knn_predictions= knn_model.predict(X_test)
In [71]:
#Creating confusion matrix
lr_cm = confusion_matrix(Y_test, knn_predictions)
lr_cm=lr_cm[::-1,::-1]
print(lr_cm)
[[ 6  2]
 [ 0 10]]
In [72]:
#Plotting confusion matrix
df_cm = pd.DataFrame(lr_cm, columns=np.unique(Y_test), index = np.unique(Y_test))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
df_cm
#plt.figure(figsize = (7,5))

sns.set(font_scale=1)#for label size
ax=sns.heatmap(df_cm, cmap="Blues", annot=True, linewidths=.5, fmt='d')# font size
ax.invert_yaxis()
ax.invert_xaxis()
In [73]:
tn, fp, fn, tp = lr_cm.ravel()
print("True Negatives: ",tn)
print("False Positives: ",fp)
print("False Negatives: ",fn)
print("True Positives: ",tp)
print("========================================")
print("Confusion Matrix Interpretations –")
#Total number of prediction
print("Total number of predictions carried out by model:", tp+fp+fn+tn)
#Accuracy
Accuracy = (tn+tp)*100/(tp+tn+fp+fn) 
print("1. Accuracy:", round(Accuracy,2), ", which means", round(Accuracy,2) , "% of outcomes that were predicted by model are correct")
#Recall
Recall = (tp)*100/(tp+fn)
print("2. Recall (Sensitivity):",round(Recall,2), ", which means that when model predicts true positives" , round(Recall,2), "% of outcomes that were predicted by model are actually true positives.")
#Precision
Precision = (tp)*100 / (tp + fp)
print("3. Precision (Exactness):",round(Precision,2),", which means that when model predicts true positives," , round(Precision,2), "% of outcomes that were predicted by model are correct.")
#Specificity 
Specificity = (tn) * 100/(tn+fp)
print("4. Specificity:",round(Specificity,2),", which means that when model predicts true negatives," , round(Specificity,2), "% of outcomes that were predicted by model are correct.")
print("========================================")
True Negatives:  6
False Positives:  2
False Negatives:  0
True Positives:  10
========================================
Confusion Matrix Interpretations –
Total number of predictions carried out by model: 18
1. Accuracy: 88.89 , which means 88.89 % of outcomes that were predicted by model are correct
2. Recall (Sensitivity): 100.0 , which means that when model predicts true positives 100.0 % of outcomes that were predicted by model are actually true positives.
3. Precision (Exactness): 83.33 , which means that when model predicts true positives, 83.33 % of outcomes that were predicted by model are correct.
4. Specificity: 75.0 , which means that when model predicts true negatives, 75.0 % of outcomes that were predicted by model are correct.
========================================
In [74]:
print("CLassification_metrics :")
print(classification_report(Y_test,knn_predictions))
CLassification_metrics :
              precision    recall  f1-score   support

           0       0.83      1.00      0.91        10
           1       1.00      0.75      0.86         8

    accuracy                           0.89        18
   macro avg       0.92      0.88      0.88        18
weighted avg       0.91      0.89      0.89        18

In [75]:
print(Y_test)
print(knn_predictions)
[0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0]
[0 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0]
In [76]:
# implementing AUC

from sklearn.metrics import roc_curve,auc
y_scores = knn_model.predict_proba(X_test)
fpr, tpr, threshold = roc_curve(Y_test, y_scores[:, 1])
roc_auc = auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve of kNN')
plt.show()

With AUC = 0.88, it means our model is capable of distinguishing between classes 88% of the time.

In [77]:
#Accuracy score for random forest classifier
print("Accuracy score for K-nearest neighbors Classifier-%.2f" %accuracy_score(Y_test,knn_predictions))
print("F1 score for K-nearest neighbors Classifier-%.2f" %f1_score(Y_test,knn_predictions))
print("Precision score for K-nearest neighbors Classifier-%.2f" %precision_score(Y_test,knn_predictions))
print("Recall score for K-nearest neighbors Classifier-%.2f" %recall_score(Y_test,knn_predictions))
Accuracy score for K-nearest neighbors Classifier-0.89
F1 score for K-nearest neighbors Classifier-0.86
Precision score for K-nearest neighbors Classifier-1.00
Recall score for K-nearest neighbors Classifier-0.75

Based on the accuracy (0.83), precision (0.8) and a recall (0.8), the model is performing very well on the train set. That means, this model can be used to predict if agriculture returns plays important contribution in Gross national income of Argentina or not.

Similarly, models can be prepared for each country as per the requirements.

Hypothesis 2- Are the government’s investments in the agriculture and forestry sector profitable or not. (Prediction will be Yes/No)

In [78]:
# creating dataframe
df_hypothesis2=pd.DataFrame(index=df_WDI.index)
df_hypothesis2['Agriculture_returns']=df_WDI['Australia'].astype(float)
df_hypothesis2['GI']=np.nan
for i in range(df_GI_v1.shape[0]):
    df_hypothesis2['GI'].iloc[41+i]=df_GI_v1['Australia'].iloc[i]
df_hypothesis2['GI']=df_hypothesis2['GI'].fillna((df_hypothesis2['GI']).min())
df_hypothesis2['GI']=df_hypothesis2['GI'].astype(float)
df_hypothesis2['%_in_GNI']=round(((df_hypothesis2['Agriculture_returns']/df_hypothesis2['GI'])*100),2)
df_hypothesis2['Target']=np.nan

# here benchmark is 50%. if %_in_GNI is greater then 7.5%, then 'yes' agriculture plays important role in GNI else 'NO'
for i in range(df_hypothesis2.shape[0]):
    if df_hypothesis2['%_in_GNI'].iloc[i] > float(50):
        df_hypothesis2['Target'].iloc[i]=str('Yes')
    else:
        df_hypothesis2['Target'].iloc[i]=str('No')
        
        
print(df_hypothesis2.shape)

df_hypothesis2.head()
(58, 4)
Out[78]:
Agriculture_returns GI %_in_GNI Target
1960 9920018457.00000 26049980000.00000 38.08000 No
1961 9920018457.00000 26049980000.00000 38.08000 No
1962 9920018457.00000 26049980000.00000 38.08000 No
1963 9920018457.00000 26049980000.00000 38.08000 No
1964 9920018457.00000 26049980000.00000 38.08000 No
In [79]:
# carrying out features and targets

features2 = df_hypothesis2[df_hypothesis2.columns[range(0,3)]]
target2= df_hypothesis2.iloc[:,-1]
# labeling target
le = LabelEncoder()
target2=list(le.fit_transform(target2))
print(target2)
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1]
In [80]:
#Scaling features

# Load the standard scaler
sc = StandardScaler()

# Scale the feature data to be of mean 0 and of unit variance
features2_scaled = sc.fit_transform(features2)

print("features shape",features2.shape)
#Creating df_feature_scaled to use it in plot
df_features2_scaled=pd.DataFrame(features2_scaled, columns=features2.columns)
print("features_scaled shape",df_features2_scaled.shape)
features shape (58, 3)
features_scaled shape (58, 3)
In [81]:
#Plotting the difference between scaled training data and without scaled data

fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20, 10))

ax1.set_title("Before Scaling")
for i in range(features2.shape[1]):
    sns.kdeplot(features2[features2.columns[i]],ax=ax1)
ax2.set_title("After Standard Scaling")
for i in range(df_features2_scaled.shape[1]):
    sns.kdeplot(df_features2_scaled[df_features2_scaled.columns[i]],ax=ax2)
In [82]:
# Splitting dataset
X_train2, X_test2, Y_train2, Y_test2 = train_test_split(features2_scaled,target2,test_size=0.2, random_state=43)
#Creating X_train and X_test  dataframes
df_X_train2 = pd.DataFrame(X_train2)
df_X_test2 = pd. DataFrame(X_test2)
print(df_X_train2.shape)
print(df_X_test2.shape)
(46, 3)
(12, 3)
In [83]:
# Using GridSearchCV to find best parameters for KNN classifier model
grid_knnclf2 = GridSearchCV(KNeighborsClassifier(), param_knn, cv=10, scoring='accuracy')
grid_knnclf2.fit(X_train2, Y_train2)
print(grid_knnclf2.best_score_)
knn_best_param2=grid_knnclf2.best_params_
print(knn_best_param2)
0.9130434782608695
{'n_neighbors': 5}
DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal. [_search.py:814]
In [84]:
#fitting model

knn_model2 = KNeighborsClassifier(n_neighbors= knn_best_param2['n_neighbors']) 
knn_model2.fit(X_train2, Y_train2)
Out[84]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')
In [85]:
# knn predictions
knn_predictions2= knn_model2.predict(X_test2)
In [86]:
#Creating confusion matrix
lr_cm2 = confusion_matrix(Y_test2, knn_predictions2)
lr_cm2=lr_cm2[::-1,::-1]
print(lr_cm2)
[[3 2]
 [0 7]]
In [87]:
#Plotting confusion matrix
df_cm2 = pd.DataFrame(lr_cm2, columns=np.unique(Y_test2), index = np.unique(Y_test2))
df_cm2.index.name = 'Actual'
df_cm2.columns.name = 'Predicted'
df_cm2
#plt.figure(figsize = (7,5))

sns.set(font_scale=1)#for label size
ax=sns.heatmap(df_cm2, cmap="Spectral", annot=True, linewidths=.5, fmt='d')# font size
ax.invert_yaxis()
ax.invert_xaxis()
In [88]:
tn2, fp2, fn2, tp2 = lr_cm2.ravel()
print("True Negatives: ",tn2)
print("False Positives: ",fp2)
print("False Negatives: ",fn2)
print("True Positives: ",tp2)
print("========================================")
print("Confusion Matrix Interpretations –")
#Total number of prediction
print("Total number of predictions carried out by model:", tp2+fp2+fn2+tn2)
#Accuracy
Accuracy2 = (tn2+tp2)*100/(tp2+tn2+fp2+fn2) 
print("1. Accuracy:", round(Accuracy2,2), ", which means", round(Accuracy2,2) , "% of outcomes that were predicted by model are correct")
#Recall
Recall2 = (tp2)*100/(tp2+fn2)
print("2. Recall (Sensitivity):",round(Recall2,2), ", which means that when model predicts true positives" , round(Recall2,2), "% of outcomes that were predicted by model are actually true positives.")
#Precision
Precision2 = (tp2)*100 / (tp2 + fp2)
print("3. Precision (Exactness):",round(Precision2,2),", which means that when model predicts true positives," , round(Precision2,2), "% of outcomes that were predicted by model are correct.")
#Specificity 
Specificity2 = (tn2) * 100/(tn2+fp2)
print("4. Specificity:",round(Specificity2,2),", which means that when model predicts true negatives," , round(Specificity2,2), "% of outcomes that were predicted by model are correct.")
print("========================================")
True Negatives:  3
False Positives:  2
False Negatives:  0
True Positives:  7
========================================
Confusion Matrix Interpretations –
Total number of predictions carried out by model: 12
1. Accuracy: 83.33 , which means 83.33 % of outcomes that were predicted by model are correct
2. Recall (Sensitivity): 100.0 , which means that when model predicts true positives 100.0 % of outcomes that were predicted by model are actually true positives.
3. Precision (Exactness): 77.78 , which means that when model predicts true positives, 77.78 % of outcomes that were predicted by model are correct.
4. Specificity: 60.0 , which means that when model predicts true negatives, 60.0 % of outcomes that were predicted by model are correct.
========================================
In [89]:
print("CLassification_metrics :")
print(classification_report(Y_test2,knn_predictions2))
CLassification_metrics :
              precision    recall  f1-score   support

           0       0.78      1.00      0.88         7
           1       1.00      0.60      0.75         5

    accuracy                           0.83        12
   macro avg       0.89      0.80      0.81        12
weighted avg       0.87      0.83      0.82        12

In [90]:
# implementing AUC

from sklearn.metrics import roc_curve,auc
y_scores2 = knn_model.predict_proba(X_test2)
fpr2, tpr2, threshold2 = roc_curve(Y_test2, y_scores2[:, 1])
roc_auc2 = auc(fpr2, tpr2)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr2, tpr2, 'b', label = 'AUC = %0.2f' % roc_auc2)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.title('ROC Curve of kNN')
plt.show()

With AUC = 1, it means our model is capable of distinguishing between classes all of the time.

In [91]:
print(Y_test)
print(knn_predictions)
[0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0]
[0 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0]
In [92]:
#Accuracy score for random forest classifier
print("Accuracy score for K-nearest neighbors Classifier-%.2f" %accuracy_score(Y_test2,knn_predictions2))
print("F1 score for K-nearest neighbors Classifier-%.2f" %f1_score(Y_test2,knn_predictions2))
print("Precision score for K-nearest neighbors Classifier-%.2f" %precision_score(Y_test2,knn_predictions2))
print("Recall score for K-nearest neighbors Classifier-%.2f" %recall_score(Y_test2,knn_predictions2))
Accuracy score for K-nearest neighbors Classifier-0.83
F1 score for K-nearest neighbors Classifier-0.75
Precision score for K-nearest neighbors Classifier-1.00
Recall score for K-nearest neighbors Classifier-0.60